32 research outputs found

    On the complementary nature of ANOVA simultaneous component analysis (ASCA+) and Tucker3 tensor decompositions on designed multi-way datasets

    Get PDF
    The complementary nature of analysis of variance (ANOVA) Simultaneous Component Analysis (ASCA+) and Tucker3 tensor decompositions is demonstrated on designed datasets. We show how ASCA+ can be used to (a) identify statistically sufficient Tucker3 models; (b) identify statistically important triads making their interpretation easier; and (c) eliminate non-significant triads making visualization and interpretation simpler. For multivariate datasets with an experimental design of at least two factors, the data matrix can be folded into a multi-way tensor. ASCA+ can be used on the unfolded matrix, and Tucker3 modeling can be used on the folded matrix (tensor). Two novel strategies are reported to determine the statistical significance of Tucker3 models using a previously published dataset. A statistically sufficient model was created by adding factors to the Tucker3 model in a stepwise manner until no ASCA+ detectable structure was observed in the residuals. Bootstrap analysis of the Tucker3 model residuals was used to determine confidence intervals for the loadings and the individual elements of the core matrix and showed that 21 out of 63 core values of the 3 7 3 model were not significant at the 95% confidence level. Exploiting the mutual orthogonality of the 63 triads of the Tucker3 model, these 21 factors (triads) were removed from the model. An ASCA+ backward elimination strategy is reported to further simplify the Tucker3 3 7 3 model to 36 core values and associated triads. ASCA+ was also used to identify individual factors (triads) with selective responses on experimental factors A, B, or interactions, A B, for improved model visualization and interpretation

    Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: theoretical aspects

    Full text link
    [EN] Cross-validation has become one of the principal methods to adjust the meta-parameters in predictive models. Extensions of the cross-validation idea have been proposed to select the number of components in principal components analysis (PCA). The element-wise k-fold (ekf) cross-validation is among the most used algorithms for principal components analysis cross-validation. This is the method programmed in the PLS_Toolbox, and it has been stated to outperform other methods under most circumstances in a numerical experiment. The ekf algorithm is based on missing data imputation, and it can be programmed using any method for this purpose. In this paper, the ekf algorithm with the simplest missing data imputation method, trimmed score imputation, is analyzed. A theoretical study is driven to identify in which situations the application of ekf is adequate and, more importantly, in which situations it is not. The results presented show that the ekf method may be unable to assess the extent to which a model represents a test set and may lead to discard principal components with important information. On a second paper of this series, other imputation methods are studied within the ekf algorithmResearch in this area is partially supported by the Spanish Ministry of Economy and Competitiveness and FEDER funds from the European Union through grant DPI2011-28112-C04-02. Jose Camacho was funded by the Juan de la Cierva program, Ministry of Science and Innovation, Spain. This study was carried out when Jose Camacho was at the Universidad Politecnica de Valencia and Universitat de Girona, Spain.Camacho Páez, J.; Ferrer Riquelme, AJ. (2012). Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: theoretical aspects. Journal of Chemometrics. 26(1):361-373. https://doi.org/10.1002/cem.2440S361373261Wold, S. (1978). Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models. Technometrics, 20(4), 397-405. doi:10.1080/00401706.1978.10489693Eastment, H. T., & Krzanowski, W. J. (1982). Cross-Validatory Choice of the Number of Components From a Principal Component Analysis. Technometrics, 24(1), 73-77. doi:10.1080/00401706.1982.10487712Nomikos, P., & MacGregor, J. F. (1995). Multivariate SPC Charts for Monitoring Batch Processes. Technometrics, 37(1), 41-59. doi:10.1080/00401706.1995.10485888Bro, R., Kjeldahl, K., Smilde, A. K., & Kiers, H. A. L. (2008). Cross-validation of component models: A critical look at current methods. Analytical and Bioanalytical Chemistry, 390(5), 1241-1251. doi:10.1007/s00216-007-1790-1Wise BM Gallagher NB Bro R Shaver JM Windig W Koch RS PLSToolbox 3.5 for use with Matlab 2005Nelson, P. R. C., Taylor, P. A., & MacGregor, J. F. (1996). Missing data methods in PCA and PLS: Score calculations with incomplete observations. Chemometrics and Intelligent Laboratory Systems, 35(1), 45-65. doi:10.1016/s0169-7439(96)00007-xArteaga, F., & Ferrer, A. (2002). Dealing with missing data in MSPC: several methods, different interpretations, some examples. Journal of Chemometrics, 16(8-10), 408-418. doi:10.1002/cem.750Arteaga, F., & Ferrer, A. (2005). Framework for regression-based missing data imputation methods in on-line MSPC. Journal of Chemometrics, 19(8), 439-447. doi:10.1002/cem.946Zhang, P. (1993). Model Selection Via Multifold Cross Validation. The Annals of Statistics, 21(1), 299-313. doi:10.1214/aos/1176349027Louwerse, D. J., Smilde, A. K., & Kiers, H. A. L. (1999). Cross-validation of multiway component models. Journal of Chemometrics, 13(5), 491-510. doi:10.1002/(sici)1099-128x(199909/10)13:53.0.co;2-2Lei, F., Rotbøll, M., & Jørgensen, S. B. (2001). A biochemically structured model for Saccharomyces cerevisiae. Journal of Biotechnology, 88(3), 205-221. doi:10.1016/s0168-1656(01)00269-3López, F., Miguel Valiente, J., Manuel Prats, J., & Ferrer, A. (2008). Performance evaluation of soft color texture descriptors for surface grading using experimental design and logistic regression. Pattern Recognition, 41(5), 1744-1755. doi:10.1016/j.patcog.2007.09.011Camacho, J., Picó, J., & Ferrer, A. (2010). Data understanding with PCA: Structural and Variance Information plots. Chemometrics and Intelligent Laboratory Systems, 100(1), 48-56. doi:10.1016/j.chemolab.2009.10.005Mercer, A. M., & Mercer, P. R. (2000). Cauchy’s interlace theorem and lower bounds for the spectral radius. International Journal of Mathematics and Mathematical Sciences, 23(8), 563-566. doi:10.1155/s016117120000257

    On the Generation of Random Multivariate Data

    Get PDF
    The simulation of multivariate data is often necessary for assessing the performance of multivariate analysis techniques. The random generation of multivariate data when the covariance matrix is completely or partly specified is solved by different methods, from the Cholesky decomposition to some recent alternatives. However, many times the covariance matrix has to be generated also at random, so that the data simulation spans different situations from highly correlated to uncorrelated data. This is the case when assessing a new multivariate analysis technique in Montercarlo experiments. In this paper, we introduce a new algorithm for the generation of random data from covariance matrices of random structure, where the user only decides the data dimension and the level of correlation. We will illustrate the application of this algorithm in several relevant problems in multivariate analysis, namely the selection of the number of Principal Components in Principal Component Analysis, the evaluation of the performance of sparse Partial Least Squares and the calibration of Multivariate Statistical Process Control systems. The algorithm is available as part of the MEDA Toolbox v1.1

    Variable-selection ANOVA Simultaneous Component Analysis (VASCA)

    Get PDF
    Motivation: ANOVA Simultaneous Component Analysis (ASCA) is a popular method for the analysis of multivariate data yielded by designed experiments. Meaningful associations between factors/interactions of the experimental design and measured variables in the dataset are typically identified via significance testing, with permutation tests being the standard go-to choice. However, in settings with large numbers of variables, like omics (genomics, transcriptomics, proteomics and metabolomics) experiments, the ‘holistic’ testing approach of ASCA (all variables considered) often overlooks statistically significant effects encoded by only a few variables (biomarkers). Results: We hereby propose Variable-selection ASCA (VASCA), a method that generalizes ASCA through variable selection, augmenting its statistical power without inflating the Type-I error risk. The method is evaluated with simulations and with a real dataset from a multi-omic clinical experiment. We show that VASCA is more powerful than both ASCA and the widely adopted false discovery rate controlling procedure; the latter is used as a benchmark for variable selection based on multiple significance testing. We further illustrate the usefulness of VASCA for exploratory data analysis in comparison to the popular partial least squares discriminant analysis method and its sparse counterpart.Agencia Andaluza del Conocimiento, Regional Government of Andalucia , in SpainEuropean Commission B-TIC-136-UGR20State Research Agency (AEI) of SpainEuropean Social Fund (ESF) RYC2020-030536-IAEI PID2020-118139RB-I0

    Group-wise Partial Least Square Regression

    Get PDF
    This paper introduces the Group-wise Partial Least Squares (GPLS) regression. GPLS is a new Sparse PLS (SPLS) technique where the sparsity structure is de ned in terms of groups of correlated variables, similarly to what is done in the related Group-wise Principal Component Analysis (GPCA). These groups are found in correlation maps derived from the data to be analyzed. GPLS is especially useful for exploratory data analysis, since suitable values for its metaparameters can be inferred upon visualization of the correlation maps. Following this approach, we show GPLS solves an inherent problem of SPLS: its tendency to confound the data structure as a result of setting its metaparameters using standard approaches for optimizing prediction, like cross-validation. Results are shown for both simulated and experimental data

    On the use of the observation-wise k-fold operation in PCA cross-validation

    Get PDF
    Cross-validation (CV) is a common approach for determining the optimal number of components in a principal component analysis model. To guarantee the independence between model testing and calibration, the observation-wise k-fold operation is commonly implemented in each cross-validation step. This operation renders the CV algorithm computationally intensive and it is the main limitation to apply CV on very large data sets. In this paper we carry out an empirical and theoretical investigation of the use of this operation in the element wise k-fold (ekf ) algorithm, the state-of-the-art CV algorithm. We show that when very large data sets need to be cross-validated and the computational time is a matter of concern, the observation-wise k-fold operation can be skipped. The theoretical properties of the resulting modi ed algorithm, referred to as column wise k-fold (ckf ) algorithm, are derived. Also, its performance is evaluated with several arti cial and real data sets. We suggest the ckf algorithm to be a valid alternative to the standard ekf to reduce the computational time needed to cross-validate a data set

    Present and Future of Network Security Monitoring

    Get PDF
    This work was funded by the Ministry of Science and Innovation through CDTI through the Ayudas Cervera para Centros Tecnologicos grant of the Spanish Centre for the Development of Industrial Technology (CDTI) through the Project EGIDA under Grant CER-20191012, and in part by the Spanish Ministry of Economy and Competitiveness and European Regional Development Fund (ERDF) funds under Project TIN2017-83494-R.Network Security Monitoring (NSM) is a popular term to refer to the detection of security incidents by monitoring the network events. An NSM system is central for the security of current networks, given the escalation in sophistication of cyberwarfare. In this paper, we review the state-of-the-art in NSM, and derive a new taxonomy of the functionalities and modules in an NSM system. This taxonomy is useful to assess current NSM deployments and tools for both researchers and practitioners. We organize a list of popular tools according to this new taxonomy, and identify challenges in the application of NSM in modern network deployments, like Software Defined Network (SDN) and Internet of Things (IoT).Ministry of Science and Innovation through CDTI through the Ayudas Cervera para Centros Tecnologicos grant of the Spanish Centre for the Development of Industrial Technology (CDTI) through the Project EGIDA CER-20191012Spanish Ministry of Economy and CompetitivenessEuropean Regional Development Fund (ERDF) funds TIN2017-83494-

    Monitorización y selección de incidentes en seguridad de redes mediante EDA

    Get PDF
    Uno de los mayores retos a los que se enfrentan los sistemas de monitorización de seguridad en redes es el gran volumen de datos de diversa naturaleza y relevancia que deben procesar para su presentación adecuada al equipo administrador del sistema, tratando de incorporar la información semántica más relevante. En este artículo se propone la aplicación de herramientas derivadas de técnicas de análisis exploratorio de datos para la selección de los eventos críticos en los que el administrador debe focalizar su atención. Adicionalmente, estas herramientas son capaces de proporcionar información semántica en relación a los elementos involucrados y su grado de implicación en los eventos seleccionados. La propuesta se presenta y evalúa utilizando el desafío VAST 2012 como caso de estudio, obteniéndose resultados altamente satisfactorios.Este trabajo ha sido parcialmente financiado por el MICINN a través del proyecto TEC2011-22579

    Group-Wise Principal Component Analysis for Exploratory Intrusion Detection

    Get PDF
    Intrusion detection is a relevant layer of cybersecurity to prevent hacking and illegal activities from happening on the assets of corporations. Anomaly-based Intrusion Detection Systems perform an unsupervised analysis on data collected from the network and end systems, in order to identify singular events. While this approach may produce many false alarms, it is also capable of identifying new (zeroday) security threats. In this context, the use of multivariate approaches such as Principal Component Analysis (PCA) provided promising results in the past. PCA can be used in exploratory mode or in learning mode. Here, we propose an exploratory intrusion detection that replaces PCA with Group-wise PCA (GPCA), a recently proposed data analysis technique with additional exploratory characteristics. A main advantage of GPCA over PCA is that the former yields simple models, easy to understand by security professionals not trained in multivariate tools. Besides, the workflow in the intrusion detection with GPCA is more coherent with dominant strategies in intrusion detection. We illustrate the application of GPCA in two case studies.This work was supported in part by the Spanish Government-MINECO (Ministerio de Economía y Competitividad), using the Fondo Europeo de Desarrollo Regional (FEDER), under Projects TIN2014-60346-R and Project TIN2017-83494-R

    Evaluation of Diagnosis Methods in PCA-based Multivariate Statistical Process Control

    Get PDF
    Multivariate Statistical Process Control (MSPC) based on Principal Component Analysis (PCA) is a well-known methodology in chemometrics that is aimed at testing whether an industrial process is under Normal Operation Conditions (NOC). As a part of the methodology, once an anomalous behaviour is detected, the root causes need to be diagnosed to troubleshoot the problem and/or avoid it in the future. While there have been a number of developments in diagnosis in the past decades, no sound method for comparing existing approaches has been proposed. In this paper, we propose such a procedure and use it to compare several diagnosis methods using randomly simulated data and from realistic data sources. This is a general comparative approach that takes into account factors that have not previously been considered in the literature. The results show that univariate diagnosis is more reliable than its multivariate counterpart
    corecore